Streetbees Case Study (ML)

January 20th, 2021

Helena Hook

Background

One of the largest food companies in the world has engaged Streetbees to conduct Life Moments research to understand what people drink. Rather than relying on people's memory of what they eat, Streetbees has asked participants to log every meal they have for a full week by taking photos and telling us what they are drinking in the moment.

Case Study

We’ve sent you a processed subset of the overall dataset from this drink consumption survey. This data contains each submission we captured during the survey, some demographic data for each user and a few relevant questions that were used to cluster these submissions.

This data was clustered using one of our algorithms, which returned clusters for each submission under the ‘cluster_id’ column. Using any tools you find useful, please present any analysis / insight you are able to gather from this data in order to help the client understand the profiles of these clusters.

In [235]:
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')
sns.set_palette('magma')
import plotly.express as px
import plotly.graph_objects as go
In [2]:
df = pd.read_csv('clean_data_streetbees.csv')
df.head()
Out[2]:
id user_id created gender age who_with where feeling why_this activity drink cluster_id hour
0 mZq8EG qkZ72 2019-07-19 11:22:11.603151 Male 18-24 Alone At my home Good great To boost energy Checking emails Hot tea 1 11
1 qjW2Pk RvgDz 2020-04-25 13:14:43.723203 Male 35-44 Alone At my home Great amazing Other, To boost energy Playing games Bottled water 1 13
2 xGmZ3J AZGl7 2020-04-09 00:15:50.964032 Female 35-44 Alone At my home Great amazing To go with other food drink Reading Soda 8 0
3 k56pvJ q0pl7 2020-01-19 21:43:07.280516 Male 18-24 Alone At my home Happy content Cheap Watching TV Diet Soda 8 21
4 r04B92 1AYVR 2020-05-09 21:31:01.238650 Male 45+ Alone At my home Great amazing To go with other food drink Watching YouTube Bottled water 8 21
In [3]:
df.shape
Out[3]:
(1202, 13)

Datetime

Survey data is gathered over 520 days (1 year 5 months) from June 2nd 2019 to November 3rd 2020 in the US

In [4]:
df['created'] = pd.to_datetime(df['created'])
In [5]:
# Data is gathered from 2 june 2019 to 3 november 2020 in the US
df['created'].min()
Out[5]:
Timestamp('2019-06-02 04:59:14.229380')
In [6]:
df['created'].max()
Out[6]:
Timestamp('2020-11-03 14:57:19.776421')
In [7]:
# survey length
df['created'].max()-df['created'].min()
Out[7]:
Timedelta('520 days 09:58:05.547041')

Gender

65% of survey answers are from females, which is almost twice as many as from men (34%), other genders 1%.
We can see that there are clusters where gender difference is more noticeable than in others.
Clusters 3, 6, 1, 2, 8 have significantly more females in them than men.
Clusters 5, 7, 9, 4 are closer to an equal split.

In [8]:
# 65% of survey answers are from women
round(df['gender'].value_counts(normalize=True)*100)
Out[8]:
Female    65.0
Male      34.0
Other      1.0
Name: gender, dtype: float64
In [9]:
gender_df = pd.DataFrame(df.groupby('cluster_id')['gender'].value_counts(normalize=True)*100).rename(columns={'gender':'count'}).reset_index()
gender_df['cluster_id'] = gender_df['cluster_id'].apply(str)
In [10]:
gender_df = gender_df.sort_values('count', ascending=False)
In [11]:
# plotly gender by clusters
data = gender_df
fig = px.bar(data, 
             x= 'cluster_id', 
             y= 'count',
             title='Gender differences in clusters',
             hover_data=['cluster_id'], 
             labels={'count':'% of the cluster', 'gender':'Gender'}, 
             barmode='group', 
             color=data['gender'].astype(str), 
             template = 'simple_white',
             color_discrete_map={
                 'Female':'#e78671' , 
                 'Male':'#bf5275' , 
                 'Other':'#ffd667'})
fig.show()

Age

38% of survey data is from people ages 45 and older,
24% from people ages 25-34,
22% from people ages 35-44
and last 15% is the youngest group of 18-24 year olds

Most popular clusters with group 45+ are 1, 7, 2, 5
Age group 35-44 fits better clusters 4, 7, 6
25-34 year olds are most popular in clusters 4, 3
And our youngest group 18-24 best recognised in clusters 8 and 9.
Interestingly cluster 4 is least popular for the oldest and youngest group, but most popular for the 25-34 and 35-44 year olds.

In [12]:
round(df['age'].value_counts(normalize=True)*100)
Out[12]:
45+      38.0
25-34    24.0
35-44    22.0
18-24    15.0
Name: age, dtype: float64
In [13]:
age_df = pd.DataFrame(df.groupby('cluster_id')['age'].value_counts(normalize=True)*100).rename(columns={'age':'count'}).reset_index()

# convert cluster_id to string for better ordering
age_df['cluster_id'] = age_df['cluster_id'].apply(str)
In [14]:
# order dataframe by values
age_df = age_df.sort_values('count', ascending=False)
In [16]:
# plotly age by clusters
data = age_df
fig = px.bar(data, 
             x= 'cluster_id', 
             y= 'count',
             title='Age differences in clusters',
             hover_data=['cluster_id'], 
             labels={'count':'% of the cluster', 'age':'Age group'}, 
             barmode='group', 
             color=data['age'].astype(str), 
             template = 'simple_white',
             color_discrete_map={
                 '45+':'#bf5265', 
                 '35-44':'#e78671',
                 '25-34':'#bf5275',
                 '18-24':'#ffd667'})
fig.show()

Who with

86% of people who logged their meals/drinks were having them alone.
Only 13% in total were with their family or partner.
And the remaining few with friends, colleagues or others.

Clusters 1, 2, 8, 6 and 9 are characterised by people who had their meal/drink alone.
Cluster 3 has people who had their meal/drink with their partner, family or friends.

In [46]:
df['who_with'].value_counts(normalize=True)*100
Out[46]:
Alone         85.773710
My partner     6.572379
My family      5.574043
Colleagues     1.331115
Friends        0.665557
Other          0.083195
Name: who_with, dtype: float64
In [47]:
with_df = pd.DataFrame(df.groupby('cluster_id')['who_with'].value_counts(normalize=True)*100).rename(columns={'who_with':'count'}).reset_index()

# convert cluster_id to string for better ordering
with_df['cluster_id'] = with_df['cluster_id'].apply(str)
In [48]:
# order dataframe by values
with_df = with_df.sort_values('count', ascending=False)
In [49]:
# plotly company by clusters
data = with_df
fig = px.bar(data, 
             x= 'cluster_id', 
             y= 'count',
             title='With who differences in clusters',
             hover_data=['cluster_id'], 
             labels={'count':'% of the cluster', 'who_with':'With who'}, 
             barmode='group', 
             color=data['who_with'].astype(str), 
             template = 'simple_white',
             color_discrete_map={
                 'Alone':'#813575', 
                 'My partner':'#af4a77',
                 'My family':'#d86870',
                 'Colleagues':'#e78671',
                 'Friends':'#eeb68d',
                 'Other': '#f4e4b8'})
fig.show()

Where

81% of people who answered the survey said they had their meal/drink at home.
9% were at school or work,
6% on the go or outdoors

Clusters 4 and 5 are only made up of people who had their meal/drink at school or work.
Clusters 1, 2, 3, 6, 8, 9 are of people who were at their homes.
People on the go or outdoors are found in cluster 7 as are the group of 'Cafe restaurant bar hotel'.

In [51]:
round(df['where'].value_counts(normalize=True)*100)
Out[51]:
At my home                   81.0
At school work                9.0
On the go outdoors            6.0
At someone else's home        2.0
Cafe restaurant bar hotel     1.0
Other                         0.0
Name: where, dtype: float64
In [52]:
where_df = pd.DataFrame(df.groupby('cluster_id')['where'].value_counts(normalize=True)*100).rename(columns={'where':'count'}).reset_index()

# convert cluster_id to string for better ordering
where_df['cluster_id'] = where_df['cluster_id'].apply(str)
In [53]:
# order dataframe by values
where_df = where_df.sort_values('count', ascending=False)
In [55]:
# plotly where by clusters
data = where_df
fig = px.bar(data, 
             x= 'cluster_id', 
             y= 'count',
             title='Where differences in clusters',
             hover_data=['cluster_id'], 
             labels={'count':'% of the cluster', 'where':'Where'}, 
             barmode='group', 
             color=data['where'].astype(str), 
             template = 'simple_white',
             color_discrete_map={
                 'At my home':'#813575',
                 'On the go outdoors':'#af4a77', 
                 'At school work':'#d86870',
                 'Cafe restaurant bar hotel':'#e78671',
                 "At someone else's home" :'#eeb68d',
                 'Other': '#f4e4b8'})
fig.show()

Feelings

Overall 1/3 of people seem to be feeling good, great or amazing.
Followed by the next biggest group who is feeling just fine/neutral.
At least 6% of people said they were feeling anxious/stressed coupled with other mixed feelings.

Most noticeably we can see that sad/depressed people can be found in cluster 7.

In [87]:
round(df['feeling'].value_counts(normalize=True)*100).head(20)
Out[87]:
Good great                       16.0
Great amazing                    15.0
Neutral fine                     11.0
Sleepy tired                      9.0
Anxious stressed                  6.0
Relaxed calm                      5.0
Sad depressed                     3.0
Content upbeat                    3.0
Happy content                     2.0
Fine okay                         2.0
Other                             2.0
Excited                           1.0
Unwell sick ill                   1.0
Annoyed irritated frustrated      1.0
Motivated                         1.0
Neutral fine Anxious stressed     1.0
Busy                              1.0
Active energetic                  1.0
Rested awake                      1.0
Thirsty                           1.0
Name: feeling, dtype: float64
In [97]:
feeling_df = pd.DataFrame(df.groupby('cluster_id')['feeling'].value_counts(normalize=True)*100).rename(columns={'feeling':'count'}).reset_index()

# only keep feelings that's percentage in the cluster is bigger than 4
feeling_df = feeling_df[feeling_df['count']>4]
In [98]:
# convert cluster_id to string for better ordering
feeling_df['cluster_id'] = feeling_df['cluster_id'].apply(str)
In [99]:
# order dataframe by values
feeling_df = feeling_df.sort_values('count', ascending=False)
In [101]:
# plotly feelings by clusters
data = feeling_df
fig = px.bar(data, 
             x= 'cluster_id', 
             y= 'count',
             title='Feeling differences in clusters',
             hover_data=['cluster_id'], 
             labels={'count':'% of the cluster', 'feeling':'Feeling'}, 
             barmode='group', 
             color=data['feeling'], 
             template = 'simple_white',
             color_discrete_map={
                 'Good great':'#813575',
                 'Great amazing':'#af4a77', 
                 'Neutral fine':'#d86870',
                 'Sleepy tired':'#e78671',
                 'Anxious stressed' :'#eeb68d',
                 'Relaxed calm': '#f4e4b8', 
                 'Sad depressed': '#3d1b62', 
                 'Content upbeat': '#b49ed9'})
fig.show()

Reasons to have a drink

Main reasons people have their drinks are either to boost their energy, just because of the taste or because it's a habit.
Other popular reasons were also: out of thirst, to go with food, a healthy chpoice, it's refreshing qualities.

In cluster 2 we can predominantly find people who had their drink just because they liked the taste.
Routine loving people and those who have a drink to boost their energy can be found in cluster 1.

In [108]:
round(df['why_this'].value_counts(normalize=True)*100).head(10)
Out[108]:
To boost energy                 16.0
I like the taste                14.0
It's my habit  routine          11.0
I was thirsty                    6.0
To go with other food  drink     4.0
It's a healthy choice            4.0
It's refreshing                  4.0
Other                            3.0
I had a craving                  3.0
To relax                         2.0
Name: why_this, dtype: float64
In [114]:
why_df = pd.DataFrame(df.groupby('cluster_id')['why_this'].value_counts(normalize=True)*100).rename(columns={'why_this':'count'}).reset_index()

# only keep feelings that's percentage in the cluster is bigger than 4
why_df = why_df[why_df['count']>4]
In [115]:
# convert cluster_id to string for better ordering
why_df['cluster_id'] = why_df['cluster_id'].apply(str)
In [116]:
# order dataframe by values
why_df = why_df.sort_values('count', ascending=False)
In [153]:
# plotly why_this by clusters
data = why_df
fig = px.bar(data, 
             x= 'cluster_id', 
             y= 'count', 
             color_discrete_sequence= px.colors.sequential.Sunsetdark_r,
             title='Why this drinks in clusters',
             hover_data=['cluster_id'], 
             labels={'count':'% of the cluster', 'why_this':'Why this'}, 
             barmode='group', 
             color=data['why_this'], 
             template = 'simple_white')
fig.show()

Activity

Most popular activity while having a drink is watching TV with 23%.
Followed by doing nothing, relaxing, browsing social media and working/studying.

Clusters 6, 8 and 9 are likely to be watching TV when having a drink.
Clusters 4 and 5 on the other hand are the ones working or studying.
Every single cluster has some people who are browsing social media.

In [141]:
round(df['activity'].value_counts(normalize=True)*100).head(10)
Out[141]:
Watching TV                           23.0
Nothing                                9.0
Browsing social media                  8.0
Working studying                       7.0
Relaxing                               6.0
Other                                  4.0
Listening to music                     3.0
Browsing social media, Watching TV     2.0
Cooking                                2.0
Housework                              2.0
Name: activity, dtype: float64
In [142]:
act_df = pd.DataFrame(df.groupby('cluster_id')['activity'].value_counts(normalize=True)*100).rename(columns={'activity':'count'}).reset_index()

# only keep feelings that's percentage in the cluster is bigger than 4
act_df = act_df[act_df['count']>4]
In [143]:
# convert cluster_id to string for better ordering
act_df['cluster_id'] = act_df['cluster_id'].apply(str)
In [144]:
# order dataframe by values
act_df = act_df.sort_values('count', ascending=False)
In [145]:
# plotly activity by clusters
data = act_df
fig = px.bar(data, 
             x= 'cluster_id', 
             y= 'count', 
             color_discrete_sequence= px.colors.sequential.Sunsetdark_r,
             title='Other activity while having their drink in clusters',
             hover_data=['cluster_id'], 
             labels={'count':'% of the cluster', 'activity':'Activity'}, 
             barmode='group', 
             color=data['activity'], 
             template = 'simple_white')
fig.show()

Drinks in clusters

Our most popular drink 'Hot coffee' clearly defines cluster 1 with 'Hot tea'.
Hot tea on the other hand is mot popular in cluster 9 with no hot coffee in sight.
Soda is drunk through all clusters.
Diet sodas drinkers are defining cluster 5.
Clusters 1 and 9 do not drink any water, be it bottled or tap.
Beer and cider drinkers can only be found in cluster 9.

In [182]:
all_drinks = pd.DataFrame(df.groupby('cluster_id')['drink'].value_counts(normalize=True)*100).rename(columns={'drink':'count'}).reset_index()
In [183]:
# convert cluster_id to string for better ordering
all_drinks['cluster_id'] = all_drinks['cluster_id'].apply(str)
In [184]:
# only keep drink that's percentage in the cluster is bigger than 
all_drinks = all_drinks[all_drinks['count']>5]
In [185]:
# order dataframe by values
all_drinks = all_drinks.sort_values('count', ascending=False)
In [188]:
# plotly
data = all_drinks

fig = px.bar(data, x='cluster_id', y='count',
             hover_data=['cluster_id'], 
             labels={'drink':'Drink', 'count':'% in the cluster'}, 
             barmode='group', 
             color=data['drink'].astype(str), 
             template = 'simple_white',
             color_discrete_sequence= px.colors.sequential.Sunsetdark, 
             height=800)
fig.show()

Why people had a specific drink?

People drink coffee first to boost energy, it's part of their routine and because they like the taste.
Soda is drunk because they like the taste and to boost energy.
Bottled and tap water both when people are feeling thirsty.

In [299]:
drink_why_df = pd.DataFrame(df.groupby(['drink', 'why_this']).count()).reset_index()
# choose specific columns
drink_why_df = drink_why_df[['drink', 'why_this', 'id']].rename(columns={'id':'count'})
# sort values
drink_why_df = drink_why_df.sort_values('count', ascending=False)
In [302]:
# only keep drink-why pairs where the pair appears more than 10 times
drink_why_df = drink_why_df[drink_why_df['count']>10]
In [306]:
import plotly.io as pio
In [308]:
# plotly drinks with reason
data = drink_why_df
fig = px.bar(data, 
             x= 'drink', 
             y= 'count', 
             color_discrete_sequence= px.colors.sequential.Sunsetdark_r,
             title='Cluster 1 drink+reason',
             hover_data=['why_this'], 
             labels={'count':'Count', 'why_this':'Reason', 'drink':'Drink'}, 
             barmode='group', 
             color=data['why_this'], 
             template = 'simple_white')
fig.show()

pio.write_html(fig, file='index.html', auto_open=True)

When do people drink

It seems people have most of their drinks from 12pm to 6 pm.

In [305]:
# plot the hour of day when people drink
plt.figure(figsize=(12,6))
sns.set_style('white')

q = df['hour'].value_counts(normalize=True).sort_index()*100
y = q.values
x = q.index

ax = sns.barplot(x=x, y=y, palette='magma')
sns.despine(top=True, right=True)

# label each bar 
for p in ax.patches:
    height = p.get_height() # get the height of each bar
    # adding text to each bar
    ax.text(x = p.get_x()+(p.get_width()/2), # x-coordinate position of data label, padded to be in the middle of the bar
    y = height+0.5, # y-coordinate position of data label, padded 0.2 above bar
    s = '{:.0f}%'.format(height), # data label, formatted to ignore decimals
    ha = 'center', # sets horizontal alignment (ha) to center
    fontsize=13)

plt.xlabel('Hour of the day', fontsize=14)
plt.ylabel('% of when people drink', fontsize=14)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.title('When do people drink?', fontsize=14);

Most popular drinks

Without a question most popular drink of choice is hot coffee.
This is followed by soda(carbonated soft drink), hot tea, bottled water and tap water.

In [173]:
# plot drinks 
plt.figure(figsize=(12,10))
sns.set_style('white')

q = df['drink'].value_counts(normalize=True, ascending=False)*100
x = q.values
y = q.index

sns.barplot(x=x, y=y, palette='magma_r', orient='h')
sns.despine(top=True, right=True)

plt.xlabel('% of what people drink', fontsize=14)
plt.ylabel('Drink', fontsize=14)
plt.xticks(fontsize=13, rotation=90)
plt.yticks(fontsize=13)
plt.title('What do people drink?', fontsize=14);

Create a wordcloud

In [69]:
all_drinks[all_drinks['cluster_id']==1]
Out[69]:
cluster_id drink count
0 1 Hot coffee 46.370968
1 1 Hot tea 16.935484
2 1 Soda 8.064516
In [70]:
d = {}
for i in range(1,10):
    by_cluster = all_drinks[all_drinks['cluster_id']==i]
    d[i] = by_cluster.set_index('drink')['count'].to_dict()
In [71]:
d[1].values()
Out[71]:
dict_values([46.37096774193548, 16.93548387096774, 8.064516129032258])
In [78]:
from wordcloud import WordCloud
In [92]:
wordcloud = WordCloud(max_font_size=50, background_color="white")
for i in range(1, 10):
    wordcloud.generate_from_frequencies(frequencies=d[i])
    plt.figure()
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.title('Cluster: {}'.format(i), fontsize=14)
    plt.show()
In [ ]: